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ABSTRACT 

Outlier detection aims to identify unusual data instances 
that deviate from expected patterns. The outlier detection 
is particularly challenging when outliers are context depen¬ 
dent and when they are defined by unusual combinations of 
multiple outcome variable values. In this paper, we develop 
and study a new conditional outlier detection approach for 
multivariate outcome spaces that works by (1) transform¬ 
ing the conditional detection to the outlier detection prob¬ 
lem in a new (unconditional) space and (2) defining outlier 
scores by analyzing the data in the new space. Our approach 
relies on the classifier chain decomposition of the multi¬ 
dimensional classification problem that lets us transform the 
output space into a probability vector, one probability for 
each dimension of the output space. Outlier scores applied 
to these transformed vectors are then used to detect the 
outliers. Experiments on multiple multi-dimensional clas¬ 
sification problems with the different outlier injection rates 
show that our methodology is robust and able to successfully 
identify outliers when outliers are either sparse (manifested 
in one or very few dimensions) or dense (affecting multiple 
dimensions). 

Categories and Subject Descriptors 

1.2 [Artificial Intelligence]: Applications and Expert Sys¬ 
tems 

General Terms 

Conditional outlier detection 

Keywords 

Conditional outlier detection. Multivariate data modeling 

1. INTRODUCTION 

Outlier detection is one of the most active topics of re¬ 
search in data mining and statistics. The objective of outlier 
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(or anomaly) detection is to find unusual data instances in 
the dataset. Outlier detection can be extremely useful for 
identifying atypical data or behaviors, unusual outcomes, or 
erroneous readings and annotations. It is often used as a 
primary data preprocessing step that helps to remove the 
noisy or irrelevant signals in a dataset |^. But most 
of the time it is utilized to identify interesting (rare) pat¬ 
terns in data that may be associated with either adverse or 
beneficial events, as in novelty detection [^|^, fraud iden¬ 
tification , network intrusion surveillance 

|58] , disease outbreak detection [^, and clinical monitoring 
and alerting p^ . 

Despite huge progress in outlier detection methodologies, 
the majority of existing outlier detection methods aim to de¬ 
tect unconditional outliers that are identified over the joint 
space of all data attributes. However, these methods are 
not suitable for many practical problems in which we want 
to identify unusual (or out of ordinary) responses (labels) as¬ 
sociated with data objects. In such a case, outliers depend 
on the context or properties of the data objects we consider. 
The application of unconditional methods here may easily 
lead to both false positives and false negatives detections. 
Let us consider, for example, an image annotation (label¬ 
ing) problem, in which we want to detect erroneous image 
tags. Suppose we applied an unconditional outlier detection 
approach to this problem. In such a case, images with rare 
subjects, even if their annotations are correct, would be de¬ 
tected due to the scarcity of the subjects in the dataset. Sim¬ 
ilarly, assume a patient with a rare disease. Even though the 
patient’s diagnoses are correct for the manifested symptoms, 
unconditional outlier detection would incorrectly mark the 
case as an outlier due to the disease rarity. On the other 
hand, assume an unusual image, say of some modern paint¬ 
ing, is assigned a label that is frequent across the database 
of images, but incorrect for that specific image or style. In 
such a case, label itself is not an outlier when considered 
without a context but it becomes one when a proper con¬ 
text is considered. Similarly, a moderately high medication 
dose may look frequent with respect to the patient popu¬ 
lation that includes both adults and children, but it may 
become abnormal when considering only children. 

The differences between unconditional and conditional out¬ 
lier detection become apparent when both problems are ex¬ 
pressed probabilistically. In conditional outlier detection we 
seek instances that fall into a low probability region of: 

P(y|x) = P(y,x)/P(x) 

where y is response (outcome) vector and x a data object 


defining the context. In contrast to this, the unconditional 
outlier detection approach seeks instances in low probability 
regions of P(y,x) or P(y). 

The focus of this paper is on the development of condi¬ 
tional outlier detection methodologies in which data objects 
are associated with multivariate (possibly high dimensional) 
binary outputs (responses) and our goal is to identify irreg¬ 
ularities or rare patterns in these responses. Typically the 
multivariate binary outputs correspond to label spaces. Ex¬ 
amples of problems that fall in this category are identifica¬ 
tion of unusual labelings of images, unusual keywords as¬ 
signed to documents, or incorrect diagnoses associated with 
the patient case, etc. The conditional outlier detection is 
particularly challenging in these settings: both context and 
interdependences in response patterns should be considered 
when detecting the outliers. 

The approach we propose in this work builds upon the 
probabilistic classifier chain model 1^ 1^ for mul- 

tidimensional prediction problems. The model represents 
posterior probability of P(y|x) by decomposing it into the 
product of univariate probabilistic predictors P(yi |x, y 7 r(i)), 
one for each output variable that depend on x and val¬ 
ues of some other variables in y, denoted as y 7 r(i). These 
univariate models can be represented and learned using a va¬ 
riety of classic discriminative methods. Briefly, each of the 
terms of the product represents a probability of observing 
one dimension of the output space. While one can always 
calculate the product of these terms to express the full pos¬ 
terior P(y|x) (via the chain rule), our approach treats all 
terms (in the vector form) as a new representation of the 
output space that accounts for both the context-output and 
output-output dependences. Our assumption is that the dif¬ 
ferent outlier methods and outlier scores can be successfully 
defined in this new space. The reason for keeping the terms 
separate is twofold. First, the errors due to various model 
estimation procedures are not combined together into one 
statistic which can make the detection of true irregularities 
(outliers) hard especially for high dimensional y. Second, 
the decomposition lets us adapt the detection procedure to 
the different types of outliers. For example, when outlier 
instances are expected to effect only one or just a few di¬ 
mensions of the output space, the outlier scoring on the new 
space may focus on the different statistic derived from indi¬ 
vidual terms as opposed to statistic one would need when 
outliers are dense and effect many different outputs. For ex¬ 
ample, when considering the image labeling one may assume 
the process of generating outliers is random and rare (e.g. 
in the image labeling a label is randomly added or omitted) 
and hence a chance seeing outliers in multiple dimensions of 
y is unlikely. On the other, when outliers are expressed over 
many dimensions (such as in network attacks) the outliers 
affect many dimensions of the output space. Keeping the 
space decomposed but still covering key contextual and out¬ 
put dependences helps us to detect more effectively outliers 
in these different settings. 

We propose and test the different outlier criteria defined 
upon the new output space that captures context-output 
and output-output dependences. The experiments are con¬ 
ducted on a number of multi-dimensional classification datasets 
with the different outlier processes injecting the errors into 
the output spaces. We demonstrate that our methodology 
is robust and able to detect outliers when the outlier sig¬ 
nal is both sparse (manifested in one or very few output 


dimensions) and dense (affecting multiple dimensions). 

The rest of the paper is organized as follows. Section[^for- 
mally defines the multivariate conditional outlier detection 
problem we are investigating. Section reviews the related 
research work. Section [4] describes the new outlier detection 
approach. Section [5| p resents the experimental results and 
evaluations. Sectiori[^ concludes the paper. 

2. PROBLEM DEFINITION 

This section provides the formal definitions and notation 
of the multivariate conditional outlier detection problem ad¬ 
dressed and researched in this paper. In particular, we con¬ 
sider a special case of the multivariate conditional outlier de¬ 
tection problem where each data instance is associated with 
d discrete-valued response variables Y = (Yi, ...,1k)- We are 
given training data Dtrain — where each ob¬ 
servation (context) x^^^ = {x^^\ is associated with 

d response (output) variables y^’^^ = Our 

goal is to identify unusual responses in the data that reside 
in (unseen) testing data Dtest = 

The fundamental challenges for building multivariate con¬ 
ditional outlier detection model are: how to build an accu¬ 
rate model representing the dependency of response variables 
y on context variables x, and mutual dependences among 
response variables. We approach this problem by modeling 
P(Y|X). However, this representation is exponential in the 
dimensionality of the output space d; hence, one of the key 
questions is how to reduce the complexity of this model. 

Notation: For notational convenience, we will omit the index super¬ 
script when it is not necessary. We may also abbreviate the expres¬ 
sions by omitting variable names; e.g., P(Yi =yi, ..., Yd =?/(i|X = x) = 
PiVi, •••, ?/d|x). 

3. RELATED RESEARCH 

Outlier detection 1^ 1^ has been studied exten¬ 

sively by data mining and statistics communities. Accord¬ 
ingly, a variety of approaches have been proposed and ap¬ 
plied to identify outliers in data and data streams. While 
outlier detection studies have been conducted by a wide 
range of communities, the concept is ill-defined, and there 
is no general consensus on what the definition of outlier 
is. Probably the most referred definition has been given by 
Hawkins [^: ‘‘An outlier is an observation which deviates 
so much from the other observations as to arouse suspicions 
that it was generated by a different mechanism. Given this 
rather broad definition, various methods were proposed to 
find the most deviating instances in a multivariate dataset. 
The methods can be roughly divided into five groups: depth- 
based approaches, distance-based approaches, density-based 
approaches, and high-dimensional approaches. 

Depth-based approaches assume that outliers are at the 
fringe of the response space and normal response are close 
or in the center of the space. The typical algorithms in this 
class include Exploratory Data Analysis [^, Isodepth [44] , 
and Fast Depth Contours [^. These methods define the 
depth of the data k by gradually removing data from con¬ 
vex hulls and data samples with small depth are reported as 
outliers. A related method is the One-Class Support Vec¬ 
tor Machine [^ which assumes all the training data belong 
to one class. The resultant decision boundary then defines 
the region of normal data, whereas the data lie across the 


boundary are considered as outliers. 

Density-based approaches assume that the density around 
a normal data example is similar to the density around its 
neighbors. Local outlier detection [^[^|^[^, isolation 
methods are common methods. Compared with the 
other approaches, density-based approaches are more locally 
sensitive and tend to achieve better accuracy. A typical rep¬ 
resentative is a Local Outer Factor (LOF) [^, which is a 
relative density score estimated by an extended /c-nearest 
neighbor approach. LOF indicates the unusualness of an 
instance, and can be used as an outlier index. This density- 
based approach has shown good performance in many ap¬ 
plications and influenced several subsequent works in the 
literature 2^ b7\ . 

Distance-based approaches assume that normal data ex¬ 
amples come from dense neighborhoods, while outliers corre¬ 
spond to isolated points. The typical method is which is 
one of the early outlier detection methods, that is still used 
in many applications. The method gives an outlier score 
to each instance using a robust variant of the Mahalanobis 
distance |^, which measures the distance between each in¬ 
stance to the main body of data distribution, such that the 
instances located far from the rest instances can be identified 
as outliers. Other methods that fall in this category include 
Knorr’s unihed approach [^, linearization method [^, ran¬ 
domized pruning methodresolution based method , 
etc. The limitation of the distance-based methods is that 
they suffer from the curse of dimensionality problem. The 
number of parameters in those models will increase quadrat- 
ically in the number of dimensions, which makes them less 
suitable for high dimensional data. 

In the high-dimensional space, one of the greatest chal¬ 
lenges is that the data samples are so sparse and there is no 
meaningful neighborhood in such space. High-dimensional 
approaches are proposed to handle such extreme cases. The 
typical methods in this class either adopt an invariant dis¬ 
tance measurement, such as, the angle based outlier detec¬ 
tion [^, or project the data to a lower dimensional sub¬ 
space, such as, grid based subspace outlier detection [^, 
sufficient dimensionality reduction [^, Bayes Exponential 
Family PCA [^, Sparse PCA [^. More recent methods 
use Gaussian processes to help matrix factorization , ex¬ 
plore the structure between independent data [^ . 

The vast majority of existing outlier detection methods 
attempts to solve the “unconditional” outlier detection prob¬ 
lem, where data instances are compared and analyzed across 
all attributes. On the other hand, an increasingly popular 
approach in recent years is the conditional (or contextual) 
outlier detection that attempts to identify outliers in a sub¬ 
set of response variables given the values of context vari¬ 
ables. While several approaches [47 18 51 have been pro¬ 
posed to this extent. Song et al. [47| proposed a model-based 
conditional outlier detection method, that uses a generative 
data representation to capture the conditional relations be¬ 
tween context and response variables, and considers the in¬ 
stances that deviate from this representation as outliers. 

Although our proposed solution shares some similarities 
with Song et al. Et], there are significant differences: 


(1) To model the underlying data representation, our ap¬ 
proach uses a multi-dimensional learning approach that 
directly learns the conditional probability distribution (a 
discriminative model); On the other hand, uses the 
Gaussian mixture models to learn the joint distribution 


P(x) and P(y) separately, and the conditional properties 
are modeled through a probabilistic mapping function. 

(2) The parameter learning in our approach exploits the chain 
decomposition [^, which reduces the multivariate con¬ 
ditional modeling to learning of d classification functions, 
that makes the method scalable to large data; However, 
learning of GMMs in requires expensive Expectation- 
Maximization steps, which limits its scalability. 

(3) In outlier detection on testing instances, our approach 
estimates and utilizes the piecewise posterior probability 
of individual responses P(yi|x), which not only improves 
the outlier detection performance to a significant extent, 
but also makes the method sensitive to low-dimensional 
outliers (sparse outliers); While the GMMs used in 

are only able to compute the conditional joint probability 
P(y|x) (estimating P(yi|x) computationally infeasible). 


4. MCODE MODEL 

This section describes MGODE, our multivariate condi¬ 
tional outlier detection approach. Briefly, we present a model- 
based outlier detection technique that learns a data model 
from a training dataset, which is assumed to be out her-free 
(or the effect of outliers in modeling is assumed negligible; 
note that the same assumption is used in [^), and then 
uses the model to detect outliers from unseen data, which 
may include outliers. Accordingly, the proposed approach 
consists of the following two phases: (1) We first build a 
probabilistic multivariate conditional model from the train¬ 
ing data. (2) The model, when it is applied to different data 
instances, is used to estimate outlier seores that measure 
how the new data patterns are likely or unlikely based on 
the trained model. Section 14.11 and 14.21 describe these two 
phases in more detail. 


4.1 Conditional Probabilistic Models of Mul¬ 
tivariate Outputs 

Our outlier detection approach summarizes the data by 
a model which is then used for outlier detection. So our 
objective first step is to build (from data) an accurate prob¬ 
abilistic model relating context variables X = {Xi,Xm) 
defining the different data objects and output variables Y = 
(Yi,..., Yd) defining the response. More specifically, we want 
to learn an accurate predictive probabilistic model P(Y|X). 

The problem of learning P(Y|X) from data has been stud¬ 
ied extensively in context of multi-dimensional learning (MDL) 
were the goal is to learn P(Y|X) and use it to sup¬ 
port multivariate classihcation tasks, that will be able to 
automatically assign tags to new images 40 ; keywords 
or topics to text documents 56 ; different functions to 
genes [^[^, and/or diseases to patients [^. The assign¬ 
ment task corresponds to finding the maximum a posteriori 
(MAP) assignment of response variables: 


y* = argmaxP(Y = y|X = x) 


( 1 ) 


= argmaxP(yi = yi, ...,1k = 2 /d|X = x) (2) 

2/1,•••,2/d 


However, we note that for the purposes of conditional out¬ 
lier detection, we are not interested in using the model to 
find the optimal assignment, instead we are interested in as¬ 
sessing how likely the observed context-output assignment 
is. 












(a) DBR 


(b) BR 


Figure 1: A comparison of Dependent Binary Rele¬ 
vance (DBR) and Binary Relevance (BR) models in 
graphical representation (o? = 3). 


A key challenge in learning P(Y|X) is that (1) X can be 
complex high dimensional space defined by a mixture of dis¬ 
crete and continuous context variables, (2) the number of 
possible assignments of values to output variables is expo¬ 
nential in d. While many different machine learning solu¬ 
tions that address the first problem exist, for example, var¬ 
ious discriminative classification techniques enhanced with 
feature regularization, the second problem is equally impor¬ 
tant and it is unfeasible to model and learn all possible out¬ 
put assignments independently. 

A simple solution to the output space problem is the Bi¬ 
nary Relevance (BR) method that assumes all responses Y 
are conditionally independent of each other given context X, 
and learns d functions separately [^|^. However, this may 
not suffice for many real-world modeling tasks where the de¬ 
pendences among the responses hold important information 
to build an accurate model. 

To introduce the dependences among outputs the Classi¬ 
fier Chains (CC) approach defines a multi-dimensional 
model of response variables by decomposing them via the 
chain rule into a product a univariate conditional models, 
one model of each variable of the output space. Briefly, CC 
framework decomposes the multivariate conditional distri¬ 
bution P(Y|X) using a product of the posterior over indi¬ 
vidual response variables (Yi, ...,Yd) as: 




(3) 


where Y 7 r(i,M) denotes the parents of Y (or in other words 
output variables Y directly depends on) in a model M. The 
framework exploits the decomposable structures of the un¬ 
derlying dependency relations among the response variables 
Y which is represented in M. Note that this representation 
generalizes the BR, by assuming M does not define any re¬ 
lations among output components (i.e., Y 7 r(z,M) = {}; an 
empty set). 

A related decomposition scheme is the Dependent Binary 
Relevance (DBR) model [^. This model does not adhere 
to the chain rule decomposing the joint of the output space 
P(Y|X), and it permits circular dependences among output 
variables. Hence it is best viewed as an approximation of 
P(Y|X), that is, 

d 

p(yi,...,yi|X;M)~p[p(y|x,Y^(i,M)), (4) 

i=l 

where 

Y^(z,m) = Y\Y = (Y,...,Y-i,Y+i,...,Yi) (5) 

Figure shows the graphical representation of DBR and 
BR when the number of response variables is 3. Compared 


with BR, DBR considers the status of all the other response 
variables in representing data. 

We note that our outlier detection approach can be de¬ 
fined and work with many different models that fit the CC 
like product decomposition ETl lb) . 


4.1.1 Learning 

The parameter learning of DBR corresponds to specify¬ 
ing the conditional probability distribution (CPD) of each 
response variable Y- 

P(Y|X, Y^(,,m)) = P(Y|X, Y,Y-i, Y+i,Yi) 

To represent individual CPDs, we use probabilistic predic¬ 
tive functions, such as logistic regression, support vector ma¬ 
chines with probabilistic outputs or the naive Bayes. In this 
work, we use logistic regression with L 2 regularization. 

Notice that each Y is dependent on the rest of the re¬ 
sponse variables Y\Y and the order of learning CPD does 
not play an important role in model building. 

4.1.2 Complexity 

Supposing we use logistic regression as our base probabilis¬ 
tic representation, we need d{m + (d — 1) + 1) = 0{dm + d^) 
parameters for a DBR model. Learning these parameters 
requires 0(d) estimations of P(Y|X, Y 7 r(z,M))- Hence, the 
overall complexity of learning a DBR is 0(d) times the com¬ 
plexity of learning logistic regression. 

4.2 Identifying Outliers 

The previous section described how to efficiently learn and 
represent multivariate data using the DBR model. In 
this section, we present how to apply the model to unseen 
testing data and identify multivariate conditional outliers 
reside in them. 

Our objective in the second phase is to estimate the degree 
of “outlier-ness” of unseen data instances using the trained 
model from the first phase. That is, we would like to define 
effective scoring metrics for a model-based outlier detection. 
An important advantages of DBR towards this objective 
is that it gives a well-defined model of posterior response 
probability [^. Recalling equation ([^, DBR allows an ef¬ 
ficient estimation of the pseudo-likelihood P(Y = y|X = x) 
for any (x, y) pair. In addition, by exploiting the decom¬ 
posable structure of the model, we can easily estimate the 
likelihood of each individual response yi given its context 
x; i.e., P(Y = yz|X = x). Namely, given an observation x, 
how likely/unlikely are individual responses yi are quantified 
into a d-dimensional vector. 

We hypothesize this piecewise posterior probability of in¬ 
dividual responses contains crucial information for identi¬ 
fying multivariate conditional outliers, and propose a new 
outlier detection method along with a set of outlier scor¬ 
ing metrics. More specifically, our method first transforms 
testing data from its original space to the probability space, 
using the DBR model we obtained from the previous phase. 
It then estimates the multivariate outlier scores using the 
conditional quantities in the new space. 

Although existing model-based conditional outlier detec¬ 
tion methods have attempted a similar approach, they 
are limited in that they only use the joint posterior probabil¬ 
ity P(y|x) by assuming the underlying distribution follows 
the Gaussian distribution. As a result, the methods would 
become less sensitive to the outlying patterns observed in 


individual dimensions especially when the dimensionality of 
the data is high; and only the patterns deviate from the 
Gaussian distribution could be detected. On the other hand, 
our approach is differentiated in that (1) it utilizes the likeli¬ 
hood estimation on each response dimension to identify out¬ 
liers; (2) it uses the DBR model (or the CCF models [^ , 
in general) to represent the data, and does not assume the 
Gaussian distribution. As a result, our proposed approach 
drives the process of outlier scoring to a more granular level 
of understanding and utilizing the conditional behaviors in 
data, and leads to a significant performance improvement in 
outlier detection. 


4.2.1 Outlier Scoring Metrics 

In this subsection, we describe five outlier scoring metrics 
that we use in our multivariate conditional outlier detection 
approach. To recall, our objective is to measure the outlier 
score of unseen testing data Dtest — 

notational convenience, let us first define a quantity of 
the n-th instance: 

= (6) 


where ^-nd M denotes the underlying data 

representation. Using this d-dimensional quantity be¬ 

low we define our outlier scoring metrics. 



Key Quantity 

Metric 

Univariate 

Metric 

Complementary 

probability 

Scorei = 1 — -P(y x) 

Multivariate 

Metric 

Robust distance 

Score 2 = {p- p)'M~^{p - p) 

Lr norms 

Scores = 1 - /o ^ 

Local outlier factor 

Score,- E ,,4(7)/I^‘(p)I 

oeATfcCp) 

One-class SVM score 

Scores = w • 4>{pn) — cr 


Table 1: Summary of the outlier scoring metrics, p 
denotes the individual posterior response probabil¬ 
ity (equation ([^). 


outlier score. 


Score^: Local Outlier Factor 

Local Outlier Factor (LOF) uses a relative density score 
estimated by an extended /c-nearest neighbor approach: 


Score4{p^^\ k) 


E lrdkio) 
oeNk(p(^)) Irdkip(^)) 


( 10 ) 


where lrdk{p^^^) is the local reachability density of p^'^^ de¬ 
fined as: 


lrdk{p^’^^) 


_ \Nk{p^’^^)\ _ 

EoeArfc(p(’*)) ma^{k-dist{o), dist{p('^'), o)) 


Scorei: Complementary Probability 

The first outlier scoring metric is a univariate scoring met¬ 
ric that uses the natural interpretation of probability. I.e., 
the metric takes an instance to estimate the com¬ 

plementary probability based on model M. Note that this 
is a widely used outlier scoring technique 

S'corei(x^’^\ = 1 — M) (7) 
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Score 2 : Robust Distance 

The robust distance measures the deviation between 
each instance and the main body of distribution, using a 
robust variant of the Mahalanobis distance method. As 
a results, the method can maintain a notion of normal data 
during the process of outlier scoring. 

Score 2 {p^^^) = robust.dist 

= {p(-^ - pYC-\p^-^ - p), (8) 


where p denotes the mean of , and C is a robust 


estimation of the covariance matrix 
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on 


w+i- 


Scores: Lr Norms 

For the purpose of multivariate conditional outlier detec¬ 
tion, in general, we are more interested in the responses 
whose likelihood is low. Using Lr norms of 1 — p^^\ we 
increase the contribution of such less likely responses to the 
outlier score, along with the choice of parameter r. 


S'corear) = I |l — 11 (9) 


In this paper, we report our results using r = oo such that 
only the least likely response (maxi(l — decides the 


which in essence summarizes the density in the neighbor¬ 
hood of p^'^K As a result, LOF estimates the unusualness of 
an instance in consideration of its local density, compared 
to the local densities of its neighbors. For more technical 
detail and theoretical discussion, see [^. 

Scores: One-Class SVM Score 

The last scoring metric is relying on the One-Glass Sup¬ 
port Vector Machine (OGSVM) technique. For training, 
OGSVM assumes all the training data belong to one (nor¬ 
mal) class and attempts to find the maximum margin hyper¬ 
plane between data and the origin. The following quadratic 
program formulates the training of OGSVM [45] . 



1 1 ^ 

t II ||2 1 .(n) 

mm - ke H-— > 
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’ n=l 


(11) 

S.t. 

II 

> 

V 
kU 

1 

b 

Al 

V 

..,N 

(12) 


>0 : Vn = 1,. 

..,N 

(13) 


where slack variables is used with parameter v to control 
the smoothness. The resultant decision boundary /(p) = w- 
0(p) — (7 then defines the region of normal data, whereas the 
instances crossing this boundary are considered as outliers. 
To estimate the outlier score on testing instances, we use 
the raw output of OGSVM, which represents the relative 
location of the instances to the decision boundary. 

Table ^ summarizes the outlier scoring metrics discussed 
in this section. After we obtain the outlier scores for testing 
data, we once again convert the scores to the percentile rank 
of the instances. This step allows us to evenly distribute the 
instances across the full range of the outlier score, and lets 
us perform a more stable outlier detection. 



















5. EXPERIMENTAL RESULTS 

To validate our approach and demonstrate its effective¬ 
ness, we present experimental results on real-world datasets. 
In particular, through this section, we would like to verify 
(1) whether considering the conditional dependency among 
response variables improves the performance in outlier de¬ 
tection and (2) whether exploiting the piecewise probabilis¬ 
tic estimation of individual responses is useful in identifying 
outliers. 

The evaluation of the performance in outlier detection, 
however, is not straightforward. This is due to the unsuper¬ 
vised nature of the task that we do not have knowledge on 
how outliers exist in a given dataset. Therefore, we make 
the following assumptions before we design our experiments. 

• Outliers are the fallouts of a conditional data generation 
process that assigns to each observation (x) the most 
probable response (y). Hence, outlying components are 
not in the observation space but in the response space. 

• The datasets we use in the experiments may contain a 
small portion of outliers that, however, do not affect the 
comparison of methods in general because the fraction is 
too small to influence our model building process and the 
resultant data representation. 

• Although a process of outlying response cannot be known 
nor modeled, we assume we can create outliers by per¬ 
turbing the responses in data. 

Based on these assumptions, we conduct our experiments 
that consist of two parts. In section [5T] we consider a realis¬ 
tic scenario where a fraction of responses are outlying when 
they are conditioned on contexts. We compare eight dif¬ 
ferent outlier detection methods on six real-world datasets, 
and show that our approach produces competitive results. 
In section |5.2| on the other hand, we set up a controlled 
situation where we can adjust the number of incorrect re¬ 
sponses can be wrong per outlier. Through the experiments 
on three real-world datasets, we show our approach is even 
sensitive to sparse outliers as well as to dense outliers. 

5.1 Experiment 1 

5.1.1 Data 

In the first part of our experiments, we evaluate the gen¬ 
eral performance of our outlier detection approach. We use 
six multi-dimensional datasets obtained from multiple do- 
mainsQ These include semantic video/image labeling (Me- 
diamill [46|, Coreldk [^), text categorization [Bihtex [26| , 
Reuters [32]), clinical patient classification Medieal and 
biology [Genhase [^). Each dataset consists of eontinu- 
ous features, which represents observation (context), and 
associated binary labels, which represents response. Table 
[^summarizes the characteristics of the datasets, including 
dataset size, label cardinality (the average number of labels 
per instance), distinct label set (the number of distinct class 
configurations that appear in the data) and data domain. 

Creating Synthetic Outliers 

In this part of our experiments, we simulate plausible sce¬ 
narios where responses can be outlying in given contexts, 
which are found virtually everywhere. For example, in se¬ 
mantic video/image labeling {Mediamill^ Corel5k), a video 

^The datasets are publicly available at http://mulan. 
sourceforge.net 


Dataset 

N 

m 

d 

LC 

DLS 

DM 

Mediamill 

43,907 

120 

101 

4.38 

43,905 

Video 

Bibtex 

7,395 

1,836 

159 

2.40 

7,384 

Text 

Reuters 

6,000 

47,236 

101 

2.88 

5,990 

Text 

Corel5k 

5,000 

499 

374 

3.52 

4,999 

Image 

Medical 

978 

1,449 

45 

1.24 

58 

Clinical 

Genhase 

662 

1,185 

27 

1.25 

24 

Biology 


Table 2: Datasets characteristics. (N: number of 
instances, m: number of features (observation), d: 
number of labels (response), LC: Label cardinality, 
DLS: distinct label set, DM: domain) 

clip or image may have irrelevant tags; in clinical diagno¬ 
sis (Medieal), a patient may receive an inaccurate diagnosis; 
and, in gene function analysis (Genbase), a gene sequence 
may be associated with wrong functional labels. 

To simulate them, we inject outliers into the response 
space by the following sequence: 

1: Bootstrap testing data with size 5,000 (optional). 

2: Perturb 0.5% of response variables uniformly at random, 
with no pre-seleetion nor prioritization of either instanees 
or response dimensions. 

After a perturbation process, we will have a bootstrapped 
test dataset with < 0.5% of outliers. Note that the bootstrap 
step is optional for smaller sized datasets, on which only few 
outliers would be injected and, hence, we cannot perform a 
proper statistical evaluation. 

5.1.2 Methods 

We compare the performance of our approach with other 
widely used multivariate outlier detection methods, includ¬ 
ing the Robust Distanee (RD) [^ approach, One-elass SVM 
(OCSVM) [^ and Loeal Outlier Faetor (LOF) [^. To use 
these methods, we concatenate each observation and its as¬ 
sociated responses into one vector, so that the methods can 
run over the joint space of all data attributes. To evalu¬ 
ate our multivariate eonditional outlier deteetion (MCODE) 
approach, we use Dependent Binary Relevanee (DBR) [^ 
as the base data model, and apply the five scoring metrics 
presented in section |4.2.1| We refer to them with the fol¬ 
lowing identifiers: MCODE-ComP uses the complementary 
probability score; MCODE-RD uses the Robust distance 
[ 43 ] score; MCODE-Loo uses the Loo norm score; MCODE- 
OCSVM uses the one-class SVM [^ score; and MCODE- 
LOF uses the Local Outlier Factor m score. 

For a fair comparison, we fix the following parameters 
throughout all experiments: To train the SVM classifiers 
for OCSVM and MCODE-OCSVM, we use the radial ba¬ 
sis function (RBF) kernel; we set the OCSVM parameter 
v = 0.01. For LOF and MCODE-LOF, the number of neigh¬ 
bors k is fixed to 30 as used in their original work [^. We 
use L 2 -penalized logistic regression for DBR; we choose the 
regularization parameter by cross validation. 

Lastly, recall that OCSVM is a semi-supervised method 
and, in order to use it as a scoring metric (MCODE-OCSVM), 
we need to train a classifier which takes the posterior prob¬ 
ability of individual responses (p) as inputs. Notice that, 
to avoid overfitting, the data to train OCSVM should be a 
different subset from the data used to train the DBR model. 
To do this, in all experiments, we use only the half of train¬ 
ing data to train DBR, and hold out the rest for the training 








AUC 

Multivariate Methods 


Multivariate Conditional Outlier Deteetion 


RB 

LOF 

OCSVM 

MCODE-ComP 

MCODE-RB 

MCODE-Loo 

MCODE-LOE 

MCODE-OCSVM 

Mediamill 

Bibtex 

Reuters 

CorelSk 

Medical 

Genbase 

0.734 (0.187) 
0.501 (0.049) 
0.512 (0.051) 
0.516 (0.059) 
0.516 (0.051) 
0.512 (0.054) 

0.820 (0.045) 
0.807 (0.035) 

1.000 (0.000) 
0.947 (0.031) 
1.000 (0.000) 
1.000 (0.000) 

0.780 (0.031) 
0.512 (0.056) 
0.538 (0.054) 
0.630 (0.062) 
0.562 (0.048) 
0.848 (0.033) 

0.962 (0.019) 
0.839 (0.032) 
0.903 (0.022) 
0.828 (0.029) 
0.963 (0.013) 
0.986 (0.020) 

0.931 (0.025) 
0.977 (0.088) 

0.978 (0.017) 
0.525 (0.086) 
0.633 (0.216) 
0.975 (0.102) 

0.974 (0.017) 

0.888 (0.032) 
0.960 (0.013) 
0.868 (0.029) 
0.965 (0.014) 
0.986 (0.020) 

0.892 (0.040) 
0.930 (0.025) 

1.000 (0.000) 
0.975 (0.018) 
1.000 (0.000) 
0.998 (0.006) 

0.921 (0.022) 
0.762 (0.037) 
0.823 (0.034) 
0.795 (0.038) 
0.936 (0.024) 
0.987 (0.018) 

Rank 

8.00 (0.00) 

2.83 (2.11) 

6.83 (0.40) 

3.92 (1.02) 

4.33 (2.34) 

3.08 (1.20) 

2.17 (1.17) 

4.83 (1.44) 


Table 3: [Experiment 1] The mean and standard deviation (in parentheses) of the area under the receiver 
operating characteristic curve (AUC). The best methods (by paired t-test at o = 0.05) on each dataset are 
shown in bold. The last row shows the mean and standard deviation in the ranks of the methods (by the 
Friedman test followed by Holm’s step-down procedure at o = 0.05). 



(a) Mediamill 


(b) Bihtex 


(c) Reuters 



Figure 2: [Experiment 1] The comparisons of existing multivariate outlier detection methods (gray) and their 
use in our multivariate conditional outlier detection (MCODE) approach (green) in terms of the area under 
the receiver operating characteristic curve (AUC). The x-axis indicates different outlier detection methods. 
The y-axis indicates AUC. The red vertical bars show the standard deviation. 


of OCSVM. 

5.1.3 Metric 

We use the area under the reeeiver operating eharaeter- 
istie eurve (AUC) to evaluate different methods. AUC is 
a single number summary of the ROC curve which draws 
the ratio between true positive rate (TPR) and false posi¬ 
tive rate (FPR) by sweeping the threshold over the range of 
output scores. AUC is particularly useful when the optimal 
decision threshold is unknown. Note that the higher AUC 
is, the better the performance is. 

5.1.4 Results 

Table shows the area under the reeeiver operating ehar- 
aeteristie eurve (AUC) of the compared methods. We have 
performed ten-fold eross validation with three repeats for 
all of the datasets. The mean and standard deviation (in 
parentheses) over 30 runs are reported. On each dataset, 
we mark the best methods and their statistically equiva¬ 
lent methods (by paired t-tests at 0.05 significance level) in 
bold. The last row shows the mean and standard devia¬ 
tion in the ranks of the methods computed by the Friedman 
test followed by Holm’s step-down procedure with a 0.05 


significance level Again, the statistically superior 

methods are marked in bold. 

We can see that our approach consistently produces com¬ 
petitive AUC scores. For example, MCODE-LOF outper¬ 
forms the other methods on four datasets; MCODE-RB and 
MCODE-Loo outperform the other methods on one of the 
datasets, respectively. Among the baseline methods, LOE is 
shown as a close competitor. It produces the best AUCs on 
three datasets, and results in competitive AUCs on the rest 
three datasets. We attribute this to the process of its relative 
density estimation (Equation (10)). I.e., the computation of 
local densities in LOE can be understood as an estimation 
of likelihood conditioned on local information. As a result, 
LOE can effectively approximate the conditional probabil¬ 
ity estimation. On the other hand, RB and OCSVM do not 
seem to properly handle the multi-dimensional data. Their 
unconditional approaches to identify outliers over the joint 
space of all data attributes do not show much efficacy. This 
is partially due to the high-dimensionality of the data in 

One way to assort the results and analyze the benefits 
of our approach is to directly compare each baseline and 
its counterpart in MCODE. Eigure compares RB, LOE 
and OCSVM from this perspective. The y-axis indicates 
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(a) AUC-PR on Mediamill 

(b) AUC-PR on Bihtex 


(c) AUC-PR on Corel5k 
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Figure 3: [Experiment 2] The changes in the area under the precision-recall curve (AUC-PR) according to the 
different outlier injection rates. The x-axis indicates outlier injection rates. The y-axis indicates AUC-PR. 


AUC. The x-axis indicates different methods. The results 
are grouped by the scoring techniques, where the gray bars 
show the baseline results, and the green bars show that of 
MCODE. We can see significant improvement from the base¬ 
line to MCODE, especially on RB and OCSVM. Although, 
as described above, the performance of LOE is already good 
as is, by directly working on the conditional probability 
space, MCODE-LOE even improves the AUC scores. 

In summary, the experimental results demonstrate that 
our MCODE methods, which transforms testing data from 
its original space to the conditional probability space, ac¬ 
tually helps in the identification of outliers and, hence, im¬ 
proves the results. 

5.2 Experiment 2 

5.2.1 Data 

In the second part of our experiments, we would like to 
test the sensitivity of the methods to the number of outlying 
dimensions; i.e., we are moving from sparse (each outlying 
instance has one or very few outlying dimension) to denser 
(each manifests multiple outlying dimensions) outliers, and 
test how well each method performs along with this change. 

We use three of the multi-dimensional datasets: Mediamill 
[46] (video annotation), Bihtex (text categorization) and 
CorelBk (image labeling). See tablefor the character¬ 
istics of the datasets. 

Creating Synthetic Outliers 

In this part, we use a rather controlled setting, where 
we can adjust the number of outlying dimensions. Note 
that this can be a very useful testing protocol in practice, 
especially for the problems where experts are involved in 
data labeling (e.g., , making clinical decisions). 

To simulate such scenarios, we inject outliers into the re¬ 
sponse space by the following sequence: 

1: Bootstrap testing data with size 5,000 (optional). 

2: Seleet 0.5% of instanees uniformly at random. 

3: For eaeh selected instances, select p response dimensions 
uniformly at random; Perturb the values in the selected 
dimensions. 


After a perturbation process, we will have a bootstrapped 
test dataset with exactly 0.5% of outlier instances, where 
each outlier has p outlying dimensions. 

5 . 2.2 Metric 

We use the area under the precision-recall (PR) curve 
(AUC-PR). Similar to the AUC score, AUC-PR is the one 
number summary of the PR curve. While the score is rel¬ 
atively more conservative than AUC, it is useful to depict 
the sensitivity of methods particularly when the target dis¬ 
tribution is imbalanced, as in the outlier detection tasks. 

5.2.3 Results 

Eigure shows the AUC-PR of the methods. We have 
performed ten-fold cross validation with three repeats for 
all experiments. The y-axis indicates AUC-PR. The x-axis 
indicates the number of outlying dimensions. We use differ¬ 
ent colors and shapes (solid or dotted) to indicate different 
methods. Simply speaking, the dotted lines show the AUC- 
PR of MCODE, which are superior in general, whereas the 
solid lines show that of the baselines. 

Intuitively, the smaller the outlying dimension is, the harder 
the outliers are to be detected. Such trends are well cap¬ 
tured in the figure Most methods start from the bottom 
quarter of the plots, and gradually improve as the number 
of outlying dimension increases. However, we can see the 
MCODE methods usually start at relatively higher AUC- 
PRs. As the number of outlying dimension increases, the 
differences become more obvious. That is, the AUC-PRs of 
MCODE grow rapidly, while that of the baseline methods 
are relatively slower (OCSVM), or seem invariant (RB and 
LOE) up to this small number of outlying dimensions. 

In summary, this part of our experiments verifies that ex¬ 
ploiting the piecewise posterior response probability not only 
helps to improve the outlier detection performance in gen¬ 
eral, but also makes the methods more sensitive to the small 
degree of perturbations. 

6. CONCLUSIONS 

We studied a special case of outlier detection problem 
where outliers are context dependent and when they are de- 











fined by unusual combinations of multiple outcome variable 
values. We reviewed existing outlier detection approaches 
and multi-dimensional learning methods and presented a 
new conditional outlier detection approach for multivariate 
outcome space. The key motivation of our approach is that 
we can transform the conditional outlier detection to an un¬ 
conditional space, and solve the problem more effectively. 
Accordingly, we defined five outlier scoring metrics by ana¬ 
lyzing the data in the new space. Experiments on two outlier 
detection settings demonstrate that our approach is not only 
competitive, but also sensitive to sparse outliers. 
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